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Abstract 

In this paper we propose and examine gap statisics for assessing uniform distribution hypotheses. We provide examples 
relevant to data integrity testing for which max-gap statistics provide greater sensitivity than chi-square (x 2 ) , thus allowing the 
new test to be used in place of or as a complement to testing for purposes of distinguishing a larger class of deviations from 
uniformity. We establish that the proposed max-gap test has the same sequential and parallel computational complexity as x 1 
and thus is applicable for Big Data analytics and integrity verification. 
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Introduction 

Distribution testing is a fundamental statistical problem that arises in a wide range of practical applications. At its 
core, the problem is to assess whether a dataset that is assumed to comprise samples from a known probability 
distribution is in fact consistent with that assumption. For example, if the end state of a computer simulation of a 
physical system is a set of points with an expected physics-prescribed distribution, then any detected deviation 
from that expected distribution could undermine confidence in the results obtained and possibly in the integrity of 
the simulation system itself. 

Data integrity verification is a related application for distribution testing in which the objective is to detect 
evidence of tampering, e.g., human-altered data. For example, many sources of numerical data produce numbers 
with first digits conforming to the Benford-Newcomb first-digit distribution (This phenomenon is often referred to 
as "Benford's Law") [1, 2], while digits other than the first and last are uniformly sampled from {0,...,9} [3]. Digits in 
human-created numbers, by contrast, tend to exhibit high regularity with all elements of {0,...,9} represented with 
nearly equal cardinality. Statistically identified deviations of this kind have been used to uncover acts of scientific 
misconduct and accounting fraud [4, 5, 6, 7, 8, 9], but there is an increasing need for higher-sensitivity tests. 

There is of course no way to make an unequivocal binary assessment of whether a dataset of samples conforms to a 
given distribution assumption, but it is possible to devise statistical tests which can assign a rigorous likelihood 
estimate to the hypothesis that the dataset does (or does not) represent samples from the assumed distribution. In 
this paper we briefly review the most widely-used method for distribution testing, the chi-square ( j 2 ) test, and 
then develop alternative tests based on the statistics of gap- widths between data items of consecutive rank. Our 
principal contribution is a max-gap test which is shown to provide superior sensitivity to regularity deviations 
from a uniform distribution that are relevant to data integrity testing [10, 11, 12]. We show that this test can be 
evaluated with the same optimal computational complexity (serial and parallel) as the conventional j 2 test and is 
therefore suitable for extremely large-scale datasets. 

Chi-square Test 

The x 1 test is a statistical measure that can be applied to a discrete dataset to assess the hypothesis that its 
elements were sampled from a particular distribution. More specifically, it is a histogram-based method to measure 
the goodness-of-fit between the observed frequency distribution and the expected (theoretical) frequency 
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distribution. The general procedure of the test includes the following steps: 

1. Calculate the chi-square statistic, x 1 > which is a normalized sum of squared differences (deviations) 
between observed and expected frequencies. 

2. Determine the degrees of freedom, df, of that statistic, which is essentially the number of frequencies 
reduced by the number of parameters of the fitted distribution. 

3. Compare j 2 with the critical value for the chi-square distribution with df degrees of freedom. 



X 2 - Pearson's cumulative test statistic 


FIG. 1 COMPLEMENT OF THE CUMULATIVE DISTRIBUTION FUNCTION OF THE j 2 DISTRIBUTION, SHOWING j 2 ON THE x-AXIS 

AND P-VALUE ON THE y-AXIS [13] 

An example of the complement of the cumulative distribution function of the j 2 distribution is shown in Fig. 1 
with different degrees-of-freedom values. For uniformity testing, the procedure can be expressed as follows: 

1. Given N observations, construct an N -bin histogram. Let b { be the bin count for the z th bin (z=l,...,AT), 
which is the observed as frequency distribution. As we are testing for uniformity, the expected frequency 
distribution = 1, Vz = 1 , . . N . 


2. Compute the chi-square test statistic: 


n , u ,2 n 


1=1 e i 


i = 1 


(i) 


3. The number of degrees of freedom, df is N - 1 for this case because the counts for N - 1 bins uniquely 
determine the count for the remaining bin. 

4. Compute the complement of the cumulative distribution function of the j 2 distribution with j 2 and df 
obtained from the previous steps. Compare this value with the significance level a for the test result. 

Despite being the de-facto standard for assessing dataset consistency with respect to a given distribution 
assumption, the j 2 test is not optimally sensitive to the types of deviation from uniformity that arise in many data 
integrity applications. One example involves narrow-band missing data resulting from a corrupted sensor or 
measurement process. Another example involves data that are generated from a non-random process and exhibits 
a higher degree of data regularity than is expected for a uniform distribution [14, 15]. Datasets of the latter kind are 
typical of artificial and human-generated data, e.g., as in a forged dataset that has been tailored to include 
deviations that qualitatively resemble (to humans) uniform random deviates. In the following section we 
demonstrate the advantage of the proposed max-gap test over x 2 f° r narrow-band and high-regularity deviations 
from uniformity. 


Max-gap Test 

The maximum gap, or max-gap, for a dataset of real values is defined as the maximum difference between 
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elements of consecutive rank, which can be determined from a sorted ordering of the dataset. The distribution of 
spacings between consecutive-rank items in a dataset has been examined in the literature [16, 17, 18, 19], and we 
summarize here some of the results relevant to gap analysis. Assume we are given N - 1 observations on the open 
unit interval (0, 1) which divide the interval into N intervals whose lengths in ascending order are denoted by 
S(i) < 5(2) < • • • < S m . For uniformity testing, we are interested in , as it is the max-gap of the observations. The 

exact distribution of V) * s [19]’ 




P(S (N) <x) = Y j (-V v (1-A&) 


v=0 


\N - 1 


( 2 ) 


where a + = max(< 2 , 0 ) . 

From the p-value of the max-gap 5 W , denoted by p, we can perform a max-gap test for uniformity by checking 

oc oc 

the condition p > a for the one-sided test, or 1 - — >p>- for the two-sided test, where a is the significance level. 

When N is large, we may replace computation of the exact cumulative distribution of the max-gap in Eqn. 2 with 
the following asymptotic result [19]: 




V— »oo 

< x) - e 


e \nN-Nx 
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where the expected value of 5 W is 


jV — >00 




y + ln N 
N 


(4) 


where y is Euler's constant. 


An efficient max-gap test for uniformity can then be formalized as follows: Given N - 1 observations x t , and a 
significance level a , compute the max-gap 5 W of {0,l}^ {y} . Next, the p-value of the statistics is calculated as: 


P = 


l-e~ 


~ ns (N) 


(5) 


If the p-value satisfies p>a for the one-sided test, or 1 - — > p > — for the two-sided test, the observations are 

deemed to pass the test. Otherwise, the set of observations is assessed to be inconsistent with a uniform-sampling 
hypothesis and fails the test. 

In the next section, we present results of experiments comparing the relative sensitivities of the j 2 test and the 
max-gap test for, e.g., indentifying anomalous regularity in a presumed-uniform distribution. 


Experiments 

In this section, we compare the max-gap test versus the most well-known and commonly used x 1 test. We 
conducted four experiments involving datasets of N = 10,000 samples, with the result for each experiment obtained 
as an average of one million independent tests. Sensitivity is assessed by comparing the respective p- values for the 
one-sided forms of the two tests, where smaller values indicate greater sensitivity. The first experiment was 
performed using a dataset of samples from a true uniform distribution. As expected, the dataset passed both tests 
for uniformity with p = 0.5. 

The second experiment examined sensitivity to the difference between a uniform distribution and a normal 
distribution with standard deviation a sampled within a fixed interval (0, 1). The distinctive shape of the normal 
distribution is realized within the interval when cr is small but flattens with increasing values and approaches 
uniformity. Both tests are equally sensitive for small a , and both approach p = 0.5 for large a , but the j 2 test 
exhibits higher sensitivity for intermediate values (see Fig. 2). The latter is not surprising because the ^ 2 test is 
ideally sensitive to deviations from normality. 
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Significance Level versus Variance 



INTERVAL. WHEN THE STANDARD DEVIATION ( cr ) IS SMALL, BOTH TESTS EASILY IDENTIFY THE DATA'S NON-UNIFORMITY. AS 
cr INCREASES, THE DATA DISTRIBUTION APPROACHES UNIFORMITY WITHIN THE SAMPLE INTERVAL AND HENCE THE P- 
VALUES CONVERGE TO 0.5. THIS IS AN EXAMPLE IN WHICH THE j 2 TEST PROVIDES INHERENTLY GREATER SENSITIVITY THAN 

THE MAX-GAP TEST 

The third experiment examined sensitivity of the two tests to a uniform distribution with a narrow-band exclusion 
(Fig. 3). This of course is a problem for which the max-gap test is ideally suited, and Eqn. 4 reveals that superior 
sensitivity. What is possibly the most interesting about the results is that the x 1 test provides only modest 
sensitivity even as the exclusion width approaches one percent of the distribution window. 


Significance Level versus Missing Width 



FIG. 3 P- VALUES OF THE TEST AND THE MAX-GAP TEST 

FOR NARROW-BAND MISSING DATA. IN THIS CASE THE 
MAX-GAP TEST PROVIDES INHERENTLY GREATER 
SENSITIVITY 


Significance Level versus Number of Bins 



FIG. 4 P- VALUES OF THE ^ 2 TEST AND THE MAX-GAP TEST 
FOR HIGH REGULARITY DATA. REGULARITY FOR A DATASET 
OF SIZE N IS PARAMETERIZED BY A NUMBER OF BINS K WITH 
N/K UNIFORM SAMPLES WITHIN EACH OF K-EQUAL BINS. 
THUS K = 1 GENERATES A UNIFORM DISTRIBUTION AND 
INCREASING K APPROACHES REGULAR SPACING. THE MAX- 
GAP TEST AGAIN DEMONSTRATES GREATER SENSITIVITY 


The fourth experiment is the most relevant to data integrity applictions. It examined sensitivity to regularity in 
sample spacing. Anomalous distribution regularity is a common characteristic of human-altered data because 
people typically underestimate the degree of natural "clustering” that is present in data sampled from a truly 
uniform distribution. As a consequence, human-created or human-altered data tend to have higher regularity, i.e., 
tend to be "more evenly distributed”, than what is expected for uniformly-distributed data. More generally, high- 
regularity deviations from uniformity can arise from the unanticipated influence of a structured or non-random 
process, e.g., frequency-combing effects from a physical sensor or simulation artifacts resulting from a low-quality 
pseudorandom number generator. 
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A regularity parameter 1 < k < N was used for this experiment by uniformly distributing N / k samples within each 
of k equal-width subdivisions of the distribution interval. Thus, k = 1 represents a uniform sampling over the entire 
interval and produces a uniform distribution; and as k increases to N, the spacing between samples becomes 
increasingly regular. Although uniform and high-regularity distributions are difficult for humans to distinguish 
visually. Fig. 4 shows that the max-gap test provides significantly higher sensitivity than j 2 to subtle regularity 
deviations from uniformity. 


Min-Gap Test 

The one-sided variants of the max-gap and j 2 tests were used because they provide a practical balance between 
high sensitivity and low false alarm rates, but the one-sided or two-sided of either test may provide the optimal 
trade-off for the needs of a particular given application. In some applications, the optimal trade-off might be 
obtained from a min-gap, S (1) , test. The min-gap approximated distribution is given by [19] 


N-> oo 

P(S ( 1) ^x) = e 


\nN-Nx 



(6) 


and its expected value is [19] 


N—> oo 




y + lnN + J'- 


i = 1 


N 


(7) 


A min-gap test can be defined and performed analogously to the max-gap test and would be ideally suited for 
detecting spuriously-replicated data items. However, several simpler non-statistical methods can be applied to 
detect replicated data, so the potential applications of the min-gap test may be somewhat more limited than the 
max-gap test. 


Computational Considerations 

In terms of computational complexity, both j 2 and max-gap tests can be evaluated in optimal O(N) time and O(N) 
space. This complexity is achieved for max-gap by the use of the Gonzalez algorithm [20, 21] to determine the max- 
gap in linear time without sorting. The Gonzalez algorithm performs a special binning which guarantees by the 
pigeonhole principle that the max-gap data items will be found as the maximum and minimum values, respectively, 
in consecutive non-empty bins. This algorithm allows the max-gap test to be evaluated in optimal O(N) time and 
space, i.e., the same as j 2 , and is as efficiently parallelizable as the j 2 test (The max-gap and j 2 tests are both 
highly amenable to parallelization with 0(N/P) time complexity on P processors.). 

The min-gap pair needed to implement a min-gap test which can be identified in optimal expected O(N) time and 
space using Rabin's randomized closest-pair algorithm [22, 23]. Unlike the Gonzalez algorithm for max-gap, Rabin's 
algorithm generalizes efficiently to higher dimensions. 


Discussion and Future Work 

We have defined and developed a max-gap test for distinguishing deviations from uniformity in a ID dataset of size 
N. By using Gonzalez's algorithm, we have shown that this test can be performed with commensurate efficiency, 
both serial and in parallel, with the conventional x 1 test. Our experiments demonstrate that the max-gap test 
provides improved sensitivity in two particular applications of relevance to data integrity verification. More 
generally, the proposed max-gap and min-gap tests are of potential value as alternatives or to complement the use of 
X 2 for distribution testing and discrimination. 

There are many statistical tests for equality of distributions beyond the j 2 test such as the Kolmogorov-Smirnov test 
[24, 25, 26] and the Cramer-von Mises test [27]. Of course there can be no test that is uniformly superior to all others 
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for all possible distributions, but it appears that most of the standard tests examined in the literature would be 
challenged similarly to the j 2 test to distinguish uniform from regularly distributed data. 

Potential future work could consider tests which jointly combine gap and statistics into a more sophisticated 
single test [28] which allows greater flexibility to optimize the sensitivity and false alarm trade-off for problems of 
high practical interest, e.g., big data analytics and integrity verification. On the algorithmic side, we have pointed 
out that the Gonzalez algorithm does not generalize to higher dimensions; however, relatively efficient subquadratic 
algorithms do exist for solving the largest empty circle and largest empty rectangle problems in two dimensions 
such as the algorithms in [29, 30]. Tests on 2D distributions could also potentially exploit information about the 
largest empty region of a Voronoi decomposition or the distribution of nearest-neighbor distances from a Delaunay 
triangulation. In d > 2 dimensions, it may be possible to devise gap-related statistical tests based on results from 
efficient algorithms for identifying approximations to the largest empty d-sphere or d-rectangle, but this is purely 
speculative. In higher dimensions, it may be better to abandon gap-type statistics and focus on statistics gleaned 
from efficiently-computable k-d and orthant (quad, octant, etc.) tree decompositions of point sets. 

If computational efficiency is less of a concern, a perhaps more fruitful direction for highly-sensitive distribution 
testing in high dimensions is to examine the length of the Euclidean minimum spanning tree (EMST) for a dataset. 
The expected length of the EMST of uniformly-distributed points can be determined using analysis similar to what 
has been described in this paper for estimating the expected values for the max and min gaps in ID, and we 
conjecture that EMST length is likely to be more sensitive to many practically important types of deviations from 
uniformity than the conventional % 2 test. Such an EMST test would be computationally expensive (though 
subquadratic), but this cost could be justified in applications for which subtle deviations are critically important, e.g., 
high-fidelity physics simulations. 
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